6." JAN. 2006 13:29 0 YOUNG NO. 2478 P. 4/34- 

10/564103 

WO 2005/005006 PCT/GB2004/q029^1 ^^-.^ 

IAP28 Esst rm'TO 0 S J AN 2006 



TIMING Of FSBT TOLERAIST KARAOKE GAMS 



This mvention relates to electrcmic game pirooessing. A particular exaxnple 
iixvolves the control of video game processiiig operations, but the invention has more 
5 genecal application to other types of electronic game processing. 

In a cravea^tional video games machine, a user views the game on a video monitor 
or television screen^ and controls operation of the game using a band-held keypad or 
joystick* With some g^es machines such as the Sony® PlayStation® 2^ a handheld 
controller provides two joysticks and several user^p^rated keys, along with a vibrating 
10. element to provide tactile feedback to the user of events occoxiing within the game. 

Some games xequiie nser to carry out certain actions in response to an 
indication (e.g, on a display) of a target action. The user is judged on the accuracy of his 

4 

response to tibie target actions. For sample, the system might test whether the user 
earned oxxt the correct action, or ^e£h^ the user carried out the action at the correct tune. 
15 An example of such a game is a so-called karaoke game, in vMch the user is presented 
with a series of words to be sung to form a song. A backing Iraidc for the song is played 
on a loudspeaker, and the user sings along into a microphone. The user may be judged on 
(for example) &e pitch and timing of his singiiig. 

This invention provides game processing apparatus compxising; 
20 means for indicating successive target actions to be executed by a user, each target 

action having an associated target time of execution; and 

scoring logic in which detected user actions are compared with the target actions, 
the scoring logic comprising; 

o an iiqmt arrangemeut by which user actions may be detected; 
25 o means for comparing a detected sequence of user actions with a sequence of 

target actions; and 

o means for detecting a tuning ofTset between the sequence of user actions and a 

coitesponding sequence of target actions; 
in which the comparison of subsequent user actions with respective target actions 
30 is arranged to apply the timing of&et as a relative displacem^t between the detected user 



ft 






u7 



The invention recognises a potential problem in this type of target action — user 
action game, which is tiiat the user may complete a sequence of actions coirectlys with a 
correct or good relative timing between the actions, but the whole sequence may be offset 



PAGE 4134 ' RCVD AT 11612006 8:26:45 AM [Eastern Standard Time] ' SVR:NYC-US-FAX-01/11 ' DNIS:S743 ' CSID: ' DURATION (inm-ss):1144 



6;JAN.2006 13:29 



D YOUNG 



NO. 2478 P. 5/34 



wo 200S/D05006 



PCT/6B2004/002991 



10 



15 



20 



30 





Ml 


k 


III 



electronic ganxe» because each individual user action would miss its respective tar^ time, 
the tunitig o&et codd be caused by the user's reac^^ in the case of a so-called 

'"karaoke" game» by the user's poor sense of musical timing. 

The invendon assesses fiom a sequence of target actions and user acdonSi such as 
an initial seqfiienoe, whetfaer such atiming ofiset exists. The timing o£&et is then ^lied 
as a timing correction to subsequent user actions, in an att6n:q)t to avoid penalising users 
suffering fom this timing problem. 

The invention is particularly (though not exclusively) applicable to singing or 
karaoke type games, in wbich the target actions indicate a required rnusical note; the user 
actions involve the user singing a musical note; and the input arrangement comprises a 
microphone. In sudx a game, it is preferred that the scoring logic is operahle to detect tihat 
a user has successfully carried out a target action if a musical note generated by the user is 
mthin a tolerance amount (e»g. pitch amount) of the corresponding target musical note, 

Preferably, where the target actions indicate a required word to be sung and the 
user actions involve singing 4ie required word, the scoring logic is operable to vary the 
tolerance amount in dep^ence on the required word. Preferably the scormg logic is 



qperable to not attempt scoring in respect of a predefined set of words or syllables. 
Examples of such words are those indudmg the English ^ss" or sounds, for yvi^ck it 
is relatively difficult to assess the pitch of the user's voice accurately. 

To avoid penalising users with different vocal registers, it is preened that the 
scoring logic is arranged to detect a difierence in tone between u target musical note and 
the octave-multiple of a user-gen^ated musical note which is closest in tone to the target 
musical note. 

Pirferably the target actions are arraoged as successive groups of target actions, 
the groups being separated by pauses in wbich no user action is expected. This is useful 
to give the user a break in many types of game, but in the case of a karaoke type of game 
it corresponds to the normal flow of a song as a series of lines. 

Because the user's timing may alter at such a break, h is prefened that tiie scoring 
logic is arranged to detect the timmg offiet after each pause. 

Pref^ably the scoring logic is arranged to detect the correlation between the 
sequence of user actions and the sequence of target actions at two or more possible values 
of the t i n >i>E ogiset, and to set the tuning of&et to be that one of the possible values for 
which the detected conelation is greatest 



PAGE 5134 ' RCVD AT 1/612006 8:26:45 AM [Eastern Standard Time] ' SVR:NYC-US-FAX41I1 1 ' DI«S:5743 ' CSID: ' DURATION (inm-ss):1 1-04 



6: JAN. 2006 1 3:30 



0 YOUNG 



NO. 2478 P. 6/34 



wo 2005/00S006 



PCT/6B2004/002991 



ID 



15 



20 



30 



As examples, the taig^t times of ^ecution could de£ne start times in respect of 
the assocdated target actions and/or covdd define dmatioDs in respect of the associated 
taxget actions. 

The invention also provides a melhod of game processing in v/bkh nser actions 
are compared vnSi target actions^ the method comprifitogt 

indicating successive target actions to be executed by a user» each target action 
having an associated target time of execodion; 

detecting user actions; 

comparing a detected scqueoce of user actions with a sequence of target actions; 
detecting a timing offset between the sequence of user actions and a couesponding 

sequence of target actions; 

in whidb. the cozoparison of subsequent user actions ^tii re^ective target actions 
is arranged to apply the timing ol&et as a relative displacement between the detected user 
acdons and the target times. 

Uns invention also provides compute software having program code for carrying 









ml 



medium $udx as a transmission medium or a storage medium. 

Furdier respective aspects and features of tibie invention are defined in the 
appended claims. 

Embodiments of the invention will now be described, by way of example only^ 
tvith reference to die acconq?auying drawings in :«^ch: 

Figure 1 schematically illustrates the overall system aidutecture of tiie 
PlayStation^ 

Figure 2 schematically illustrates the architecture of an Emotion Engine; 
Figure 3 schematically illustrates ftie configuration of a Ore>j?hic sjnthesis^; 
Figure 4 is a schematic diagram illustrating the logical functionality of the 
PlayStation2 in respect of an embodiment of the invention; 
Figure 5 schematically illustrates a part of a song file; 
Figure 6 schematically illustrates a screen display; 
Figure 7 schematically illustrates a song as sung by a singer with good timing; 
Figure 8 schematically illustrates a poorly timed song; and 
Figure 9 schematically illustrates a song with a corrective timing offset f^plied. 



PAGE 6134 ' RCVD AT 1/6/2006 8:26:45 AMpsteiT) Standard Tim^^ 



6: JAN. 2006 13:30 D YOUNG NO. 2478 P. 7/34 

WO 2005/005006 PCT/GB2004/002991 

Figure 1 sdiematically iUustrjEitBS the overall system architectute of tiie 
PlayStatioxi2. A system unit 10 is provided, with various peripheral devices, comectable 
to the systeax unit. 

The syst^ unit 10 comprises: an Eznolion Engme 100; a Graphics Synthesiser 
5 200; a soimd proce^or unit 300 having dynamic random access memory (DRAM); a read 
only memory (ROM) 400; a compact disc (CD) and distal versatile disc (DVD) reader 
450; a Rambus Dyxiamic Raudom Access Memory (RDRAM) ixoit SOd; an ii^ut/oulput 
processor (lOP) 700 with dedicated RAM 750. An (optional) external hard disk drive 
. (HDD) 390 may be ooxmected. 
10 The mput/output processor 700 has two Universal Serial Bus (QSB) ports 715 and 

an or IEEE 1394 port (iLink is the Sony Corporation implementation of IEEE 1394 
standard). The lOP 700 handles all USB, iLink and game controller data traffic. For 
example when a user is playing a game, the lOP 700 receives data &om die game 
conboller and directs it to the Emotion Engine 100 "which updates die curteot state of the 
15 game accordingly. The 10? 700 has a Direct Memory Access (PMA) architecture to 
&cilitate rapid data transfer rates, DMA involves transfer of dala fiom main memory to a 
device ivlthout passing it through the CPU. The USB interface is compatible with Open 
Host' Controller Inted^ce (OHCI) aid can handle data transfer rates of between 1 .5 Mbps 
and 12 Mbps. Provision of diese mt^:&ces mean that the P]ayStation2 is potratially 
20 compatible with peripheral devices such as video cassette recordras (VCRs), digits 
cameras, microphones, set-top boxes, printers, keyboard, mouse and joystick- 

Generally, in order for successful data communication to occur wth a peripheral 
device coimected to a USB port 71 5^ an appropriate piece of software such as a device 
driver should be provided. Device driver technology is very well known and will not be 
25 described in detail here, except to say ihat the sldlled man will be aware that a device 
driver or similftt software inter&ce may be required in the embodiment described here. 

In the present embodhnent, a USB nricrophone 730 is connected to the USB port. 
The microphone includes an analogue-tonUgitai converter (ADC) and a basic hardware- 
based real*<time data compression and ^coding arrangement, so tiiat audio data are 
30 transmitted by the microphone 730 to the USB port 715 in an appropriate formal, sudi as 
le-bit mono PCM (an uncompressed format) for decoding at the PlayStation 2 system 
unit 10. 

Apart from the USB ports, two other ports 705, 710 are proprietary socik^ 
allowing the connection of a proprietary non-voMle RAM memory card 720 for storing 



PAGE 7134 ' RCVD AT 1/612006 8:26:45 AM [Eastern Standard Time] ' SVRiNYC-USf AX41/1 1 ' DNIS:S743 ' CSID: ' DURATION (mm-ss):1 1-04 



6; JAN. 2006 13; 31 D YOUNG NO, 2478 P. 8/34 

wo 2005/005006 FCT/6B2004/002991 

game-ielated infoimatiQii, a hand-lield game controller 725 or a device (not showa) 

niimickii^ a hand-hdd controller, such as a dance mat 

The Emotion Engine 100 is a 12S"bit Central Processing Unit {CPU) that has been 

specifically designed for efiEicient simulation of 3 dimensional (3D) graphics for gpmes 
5 applications* The Emotion Engine components include a data \m, cache memory and 

re^steis» all of which axe 128-bit This &cilitates &$t processing of large volumes of 

multi-media data- Conventional PCs, by way of comparison, have a basic 64-bit data 

stmcture. The floating point calculation performance of the PlayStation2 is 6.2 GFLOPs. 

The Emotion Engine also comprises MPEG2 decoder circuitry which allows for 
10 . smmltaneous processing of 3D graphics data and DVD data. The Emotion Engine 

t 

performs geometrical calculations including mathematical.txansfoims and translations and 
also peifbims calculations associated vnSh. the physics of simulation objects^ for.exan^le^ 
calculation of friction between two objects. It produces sequences of image rendering 
<>nTnTnanHft whicVt are jmhseqaeotly utiKsed bv the Graphics SvDthesiser 200. The image 
15 rendering oommands are output in the form of display lists.' A display list is a sequence of 
drawmg commands that specifies to the Graphics Synthesiser which primitive graphic 
olgeots (e.g. points, lines, triangles, sprites) to draw on the screen and at which co- 
ordinates. Thus a typical display list will comprise conunands to draw vertices, 

■ 

commands to sh^de the faces of polygons, render brtm^s and so on. The Emotion Engine 

20 100 can asynchronously g^erate multiple display lists. 

The Graphics Synthesiser 200 is a video accelerator that performs rendering of 
the display lists produced by the Emotion Engine 100. The Graphics Synthesiser 200 
mcludes a graphics interface unit (GIF) which handles, tracks and manages the multiple 
display lists. The rendering function of the Or^hics Synthesis^ 200 can generate image 

25 data that supports several alternative standard oiitpttt image formats, Le^ WTSC/PAL, 
High Definition Digital TV and VESA. In general, the rendering c^)abihly of graphics 
systems is defined by the memory bandwidth between a pixel engine and a video 
memory, each of which is located withm tiie graphics processor. Conventional graphics 
systems use external Video Random Access Memory (VRAM) connected to the pixel 

30 logic via an off-chip bus which tends to restrict available bandwidth. However, the 
Graphics Synthesiser 200 of the PlayStatioa2 provides the pixel logic and the video 
memory on a single bigh-perfoimance chip which allows for a comparatively large 38.4 
Gigabyte per second memory access bandv/idth. The Graphics Synthesiser is 
theoretically capable of achieving a peak drawing capacity of 75 million polygons per 



PAGE 8/34 ' RCVD AT WMi 8:26:45 AM (Eastern Standard Time] ' SVR:NYC-USf AX-01/11 ' DNiS:S743 ' CSID: ' DURATION (min-ss):11^ 



6: JAN. 2006 13:31 0 Y0UM6 NO. 2478 P. 9/34 

wo 2005/D05006 PCT/GB2004/002991 

second. Even with a ftdl lange of effects such as textures, Ughting and fransparesocy, a 
sustained rate of 20 million polygons per second can be drawn contrauonsly. 
Accordingly, tbe Graphics Synlhesiser 200 is capable of rMdenng a film-quality image. 

The Sound Processor Unit (SPU) 300 is eflfectivdy the soundcaid of the system 
5 which is capable of recognising 3D digital sound such as Digilal Theater Surround 
(DTS®) sound and A03 (also known as Dolby Digital) which is the sound fonnat used 
ftr digital vetsalile disks (DVDs). 

A display and sound output dervice 305, such as a video monitor or television set 
with an associated loudspeaker arrangement 310, is connected to receive video and audio 
10 sigDdsfioTOlfaegr^iihicss)aithesiser 200 and the sound processing 

The main memory siq)porting the Emotion Engine 100 is the RDRAM (Rambus 
Dynamic Random Access Memory) module 500 produced by Rambus Incotporated, This 
RDRAM memory subsystem comprises RAMj a RAM controller and a bus connecting 
the RAM to the Emotion Engine 100. 
15 Figure 2 schematically illustrates the ardiitectae of the Emotion Engine 100 of 

Figure 1- The Emotion Engine 100 comprises: a floating point miit (FPU) 104; a central 
processing unit (CPU) core 102; vector unit zero (VUO) 106; vector unit one (VUl) 108; a 
graphics interfece unit (OIF) 110; an intenrapt controller (INTC) 112; a timer unit 114; a 
direct memory access controller 1 16; an image data processor unit (JPU) 1 16; a dynamic 
20 random access memoiy controller (DRAMC) 120; a sub-bus interface (SIF) 122; and all 
of these components axe connected via a 128'bit main bus 124. 

The CPU core 102 is a 128-bit processor clocked at 300 MHz, The CPU core has 
access to 32 MB of noain monory via the DRAMC 120. The CPU core 102 instruction 
set is based on MIPS HI RISC with some MIPS IV RISC instructions together with 
25 additional multimedia instructionss* MIPS HI and IV arcj Reduced instruction Set 
Computer (RISC) instruction set architectures proprietary to MIPS Technolo^es^ Inc. 
Standard instructions are 64-bit, two-way si5)erscalar, which means that two instructions 
can be executed simultaneously. Multimedia instructions, on the other hand, use 128-bit 
instmctions via two pipelines, The CPU core 102 comprises a 16KB instruction cache, an 
3D 8KB data cache and a 16KB scratchpad RAM which is a portion of cache reserved for 
direct private usage by file CPU, 

The FPU 104 serves as a first co-processor for the CPU core 102. The vector unit 
106 acts as a second co-processor. The FPU 104 comprises a floating point produ(^ sum 
ariflmietic logic unit (EMAQ and a floating point division calculator (FDIV). Both the 



PAGE 9/34 ' RCVD AT 1«6 8:26:45 AM {Eastern Standard Time] ' SVR:NyC-US-FAX-01/1 1 ' DNiS:S743 ' CSID: ' DURATION (inm-ss):11-04 



JAN. 2006 13:31 0 YOUNG 



NO. 2478 P. 10/34 



WO200S/00S0M PCr/GB2004/002991 

FMA.C attd FDIV operate oa 32-bit values so vjbesi aa operation is canied out on a 128- 
bit value ( composed of fbw 32-bit values) an operation can be canied out on all four 
parts concmxentLy. For example adding 2 vectors together can be dcme at the sanie time 

The vector units 106 and 108 peifoim mathematical operations and are essentially 
specialised FPUs that are extremely fast at evaluating the mullxpUcatton and addition of 
vector equations. They use Floating-Point Multiiply-Adder Calculators (FMACs) fi>r 
addition and multiplication operations and Floatmg-Point Dividers (FDIVs) for divi^on 
id square root operations* Iliey have built-in memcM7 for storing mi<^progcam$ and 
interifece vrtlh the rest of fte system via Vector Inter^^ Vector Unit Zero 

10 106 can work as a coprocessor to the CPU core 102 via a dedicated 128-bit bus 124 so it 
is essentially a second specialised FPU. Vector Unit One 108, on the other hand, has a 
dedicated bus to the Graphics synthesiser 200 and thus can be considered as a completely 
separate processor. The inclusion of two vector units allows the software developer to 
split up the work between different parts of the CPU and the vector units can be used in 
15 either sraial or parallel connection. 

Vector unit zero 106 comprises 4 FMACS and 1 FDIV. It is connected to the 
CPU core 102 via a coprocessor connection. It has 4 Kb of vector unit memory for data 
and 4 Kb of micro-memory for instructions. Vector unit zsino 106 is useful for performing 
physics calculations associated with the images for display. It primarily executes non- 
20 patterned geometric processing together with the CPU core 102, 

Vector unit one 108 comprises 5 FMACS and 2 KDtVs. It has no direct path to the 
CPU core 102, alftough it does have a dhrect path to the (jIF unit 110. It has 16 Kb of 
vector unit memory for data and 16 Kb of micro-memory for iustructions. Vector unit 
one 108 is useful for peifoiming transformatians. It primarily executes patterned 
25 geometric processing and directly outputs a generated display list to the GIF 1 10. 

The GIF 1 10 is an inter&ce unit to the Graphics Synthesiser 200. It converts data 
accordmg to a tag specification at the beginniog of a display list packet and transfers 
drawing comxnaods to the Graphics Synthesiser 200 whilst mutually arbitratmg multiple 
transfer. The interrupt controller (INTQ 1 12 serves to arbitrate inteuupts £rom peripheral 
30 devices, except the DMAC 116. 

The timer unit 1 14 comprises four indep^dent timers with 16-bit counters. The 



clocL The DMAC 116 handles data transfers between main memory and peripheral 
processors or main memory and the scratch pad memory. It aibitiates the main bus 124 at 



PAGE 10/34 ' RCVD AT 1/612006 8:26:45 AM [Eastem Standard Time] ' SVR:NYC-US-FAX41/11 ' DNIS:S743 ' CSID: ' DURATION (inm-ss):1 1-04 



6: JAN. 2006 13:32 0 YOUNG NO. 2478 P. 11/34 



wo 2005/00500^ PCT/GB2004/002991 

the same time. Performance optimisation of the DMAC 116 is a key way by which to 
improve Einotion Eogine performance. The image processing mat (BPU) 118 is an image 
data processor that is used to expand compressed animations and textor^ images, K 
performs I-PICTURE Macro-Block decoding, colour space conversion and vector 
5 quantisation. FinaUy, the sub-bus intefwse (SIF) 122 is an interface 

It has its own memory and bus to cqntrol I/O devices such as sound chips and storage 
devices. 

Figure 3 schematically illustrates the configuration of the Graphic ' Synthesiser 
200. The Graphics Synthesiser comprises: a host interface 202; a set-iq) / rasterizing unit 

10 204; a pixel pipehne 206; a memory mterface 208; a local memory 212 including a fiame 
page buffer 214 and a texture page buffer 216; and a video converter 210. 

The host interj^e 202 transfers data with the host (in this case the CPU core 102 
of the Emotion Engine 100). Both drawing data and buffer data from the host pass 
through this mterface. The ou^ut fiom the host interface 202 is supplied to the gra^ 

15 synthesiser 200 which develops the graphics to draw pixels based on vert^ information 
received from the Emotion Engine 100, and calculates information such as ROBA value^ 
depth value (i.e. 2;-value), texture value and fog value for each pixel, He RGBA value 
specifies the red, green, blue (ROB) colour components and the A (Alpha) component 
rqiresenls opacity of an image object The Alpha value can range fiom completely 

20 transparent to totally opaque. The pixel data is supplied to the pixel pipeline 206 which 



performs processes such as texture mapping, fogging and Alpha-blending and determines 
the final drawing colour based on the calculated pixel informatioiL 

The pixel pipeline 206 comprises 16 pixel engines PEl, PE2 PE16 so that it can 
process a .maximinn of 16 pixels concmrendy. The pixel pipeline 206 runs at ISOMHz 

25 with 32-bit colour and a 32-bit Z-biiffer. The memory interface 20u reads data fix^m mi 
writes data to the local Graphics Synthesiser memory 212. It writes the drawing pixel 
values (RGBA and Z) to memory at the end of a pkel operation and reads the pixel values 
oftiiefiBme buffer 214 fix>mmemory« These pijcel values read fit«n the frame buffer 214 
are used for pixel test or Alpba-blendingi The memory mterface 208 also reads fiom 

30 local memory 212 the RGBA values fi)r the current contents of the fiame buffer, Ihe 
local memory 212 is a 32 Mbit (4MB) memory that is built-in to ihe Graphics Synthesiser 
200. It can be organised as a fiame buffer 214, texture buffer 216 and a 32^bit Z-buffer 
215. The frame buffer 214 is the portion of video memory where pixel data such as 
colour information is stored. 



PAGE 1ir34' RCVD AT 1»68:26:45 AM [Eastern Standard 



6. JAN. 2006 13:32 D YOUNG NO. 2478 P. 12/34 

wo 200S/OOS006 PCT/GB2004/002991 

The Graphics Systibiesiser vsos a 2D to 3D textile m^piog process to add visual 

■ 

detail to 3D geometiy. Each texture may be vfrapped around a 3D image otgect and is 
stzetched and skewed to give a3D graphical effect The texture buffer is used to store the 
texture infonnatioii fen: image objects* The Z-bufGst 215 (also known as depth buffer) is 
5 the memory available to store the depth Mox];Kiatiozi for a pixel Images are constructed 
from basic building blocks known as graphics primitives or polygons. When a polygon is 
rendered with Z-bu£fering, the depth value of each of its pixels is compared with the 

■ 

corresponding value stored in the Z-bufier, If the vahie stored in the Z^bufiEer is greater 
than or equal to &e depth of the new pixdi value then this pixel is deterniined ^ 
10 that it should be rendered and tiie Z-buffer wiU be \q)dated \^ If 

i • 
I 

however the Z-buffer depth value is less than the new pixel depth value the new pixel 
value is behind what has already been drawn and will not be rendered. 

The local memory 212 has a 1024-bit read port and a 1024-bit write port for 
accessmg the fiame buffer and Z-bu3er and a SH-bit port for texture reading. The video 
15 converter 210 is operable to display the contents of the frame memory in a specified 
output foimat 

Figure 4 is a schematic diagram illustratijag the logical functionality of the 
PlayStation2 in respect of an embodiment of the invention. The functions of blodcs 
shown m Figure 4 are of course cauied out» mainly by execution of appxopriale software 

20 by parts of the PlayStation2 as shown in Figure 1» the particular parts of the PlayStatiQn2 
concerned bdng listed below. The software could be provided from disk or ROM 
storage, and/or via a transmission medium such as an internet connection. 

To execute a karaoke game, control lo^c 800 initiates the replay of an audio 
backing track from a disk storage medium 810, Tbie audio leplay is handled by replay 

25 logic 320 and takes pkce through an amplifier 307 fonning part of the television set 305^ 
and the loud^eaker 310 also forming part of the television set 

The replay or generation of a video signal to accompany the audio trade is also 
handled by the replay logic 820. Badcground images may be stored on the disk 810 or 
may instead be synthesised. Graphical overlays representing the lyrics to be sung and an 

30 indication of pitch and - Hming axe also generated m reqx)nse to data form a song file 830 
to be described below. The output video signal is displayed on the screen of the television 
set 305. 

The miCTophone 730 is also connected to the replay logic 820. The replay logic 
820 converts the digitised audio signal from the microphone back into an analogue signal 



PAGE 12/34 ' RCVD AT 11612006 8:26:45 AM |Eastern Standard Timel ' SVR:NYC4)S-FAX-01I11 ' DNIS:5743 * CSID: * DURATION {mmi]m 



i 



^6. JAN. 2006 13:33 D YOUNG 



NO. 2478 P. 13/34 



wo 2005/005006 PCT/GB2004/002991 

and supplies it to the amplifier 307 so that the user can h&^x his o^ voice through the 
loudspeaker 310, 

Hie song file will now be described. 

The song file 830 stores data defining each note which the user has to sing to 

5 complete a current song. An exaxnple of a song file is shown schematically in Figure S. 
Th^ song file is expressed in XML &rmat and starts with ameasure of the song^s t^opo, 
esqxressed in beats-per-minute. The next term is a measure of resolution, Le. what fiiaotion 
of a beat is used in the note duration figures appearing in that song file. In the* example of 
Figure 5, the resolution is "semiquaver" and the tempo is 96 beats per minute* \^4uch 

10 means that a note duration value of corresponds to a quarter of a beat, or in other 
words 1/3 84 of a minute. 

A number of so-called "song elements" follow. The first of fliese is a "sentence 
marker"* Set^tence markers are used to divide ih& song into convemeut sectiohs, whicU 
will often coincide with sentences or phrases in die lyrics. A< sentence break is often 

15 (though not always) associated with a pause during ^ch the user is not expected to sing. 
The pause might represent tibye opeoiog bats of a song, a gap between successive lines or 
phrases^ or the closing bars of the song. The sentence marker does not of itself define 
how long the pause is to be. This in fact is defined by the immediately following song 
elemeDt ^ch sets a pitch or "midi-note'' value of zero (i.e. no note) fys a particuldr 

20 duration. 

In the subsequent song elements, vAere the midi-^note value is non-zexo this 
represents a particular pitch at which the user is expected to sing. In the midi-scale, 
middle C is represented by numerical value 60. The note A above middle C, which has a 
standard frequency of 440Hz9 is represented by midi number 69. Each octave is 
25 represented by a span of 12 in the midi scale,, so (for errample) top C (C above middle C) 
is represented by midi-note value 72» 

It should be noted ^t in some systems, the midi-note value 0 is assigned to 
bottom C (about 8. 1 75Hz), but in ^ present embodiment the midi-note value 0 mdicates 
a pause with no note e3q)ected. 
30 Each non-zero midi-note value has an associated lyric This might be a part of a 

word, a v^le word or even (in some circumstances) more than one word. It is possible 
for the word to be empty, e.g. (in XML) LYRICS'" 

So, the song elements each define a pitch, a duration and a lyric. 



PAGE 13/34 * RCVD AT 1/812006 8:26:45 AM (Eastern Standard Timel * SVR:NYC4)S-FAX^1!11 * DNIS:5743 ' CSID: * DURATION (inm-ss):11^ 



6'. JAN. 2006 13:33 0 YOUNG NO. 2478 P. 14/34 

wo 2005/005006 PCT/GB2004/002991 

In the present example, the song file defines the lytics for display and the pitch 
and note lenglihs that the user is eacpected to sing. It does not define the baddng track 
which the user will hear and which in &ct will prompt the user to sing the song. The 
bajddng track is recorded sepatately, for example as a conventional audio recording. This 

5 arrangem^ meaxKS that the backing track rq)lay and the reading of data fmn. the song 
file need to start at related times (e-g. sabstantiaUy simultaneously), somefbing which is 
handled by the control logic 800. However^ in oibsx embodiments Ihe song file could 
define the backing track as wdl, fox example as a series of midi-notes to be synthesised 
mto sounds by a midi synthesiser (not shown in Figure 4, but actually embodied within 

10 IheSPU 300 of Figure 1). 

I 

Returning to Figure 4, a note clock generator 840 reads the tempo and resolution 
values from the song file S30 and provides a dock sigoal at the appropriate rate. In 
particular, &e rate is the tempo multiplied by the sub-division of each beat So» for 
example, for a tempo of 96 and a resolution of ''semiquaver'' (quarter-beat) the note clock 

15 runs at 96 X 4 beats-per-minute^ i.e. 384 beats-per*minul£. If the tempo were 90 and ibs 
resolution w^ "quaver" (half-beat) dien &e note clock would run at 180 (90 x 2) beats- 
per-minute^ and so on. 

The note clock is used to initiate reading out fiom the song file 830 of each song 
element and also to control the detecdon of pitch of Uie us^'s voice. 

20 With regard to pitch detection, Ihe signal from the USB microphone 730 is also 

supplied to a pitch detector 850. This operates during temporal windows defined by the 
note clock to detect the pitch of the user's voice within those windows. In other words, 
the pitch of the user^s voice is sampled at a sampling rate equal to die frequency of the 
note clock (384 times per minute in this example). The pitch detector uses known pitch 

25 detection techniques such as those described in the paper 'Titch Detemiination of Human 
Speech by the Harmonic Product Spectrum. The Haimonic Sum Spectmm^ and a 
Maximum Likelihood Estunate^^ A. ]VfichaeI Noll, Bell Telephone Labs - presented at the 
Symposium on Computer Processing in CommunicationSg Polytechnic Institute of 
Brooklyn, April 8-10, 1969. The detected pitch is converted to a numerical value on the 

30 midi*«ote scale as described above. The pitch value is siq}plied to a buff^ register 860 
and from there to a comparator 870. The coiresponding midi-note value fi:om the song 
file 830 is also read out, und^ the control of the note clock, and is si^pUed to another 
buffer register 880 before being passed to &e comparator 870. 



PAGE 14/34 ' RCVD AT mm 8:26:45 AM [Eastern Standard Time] ' SVR:NYC-US-FAX-01/1 1 ' DNIS:5743 ' CSID: ' DURATION {mmpm 



.'6'. JAN. 2006 13:33 0 YOUNG NO. 2478 P. 15/34 

WO 2005/005006 PCT/GB2004/002991 

Hie register 880 buffers mdi-note values from the sotig fOe, The zegister 860 
. stores tibie detected midi-note values generated in ^eespect of each of titie iLser' s singiDg over 
a certain number of consecutive note dock peziods* These data are used in a "conelatton 
test" to be described below. 

5 The comparator is arranged to compare &e midi-note value from the song file wllh 

the midi-nole value representing the detected pitch. Where a song element in the note file 
represents a note having a duration greater than 1. the absolute difference between the 
detected pitch values and correct pitch values from the song file are averaged. The 
comparator operates "with modulo 12 arithmetic. This n^ueans that m error of a multiple of 

10 12 in the usor^s pitch (1.6. an otot of one or more vdiole octaves) ^1 be disregarded in 
the assessment of how close die user has come to the coirect note, and etrois of greater 
than one octave are treated as if the detected pitch was in fact in the same octave as the 
coxrect note. This ensures timt users are not penalised in respect of the register or normal 
range of their voice. 

15 For scoxing puiposeSs the comparator detects whether die user has achieved a pitch 

within a certain threshold amount of the required pitch. The threshold amount is 
expressed on the midi*note scal&, so the threshold might be for example ± 2.5 on the midi- 
note scale. (Although notes are expressed ouly as integers on the midi-note scale, 
fractional thresholds do have meaning in the present embodiment because of the 

20 averaging of the sampled values of the user's pitch over the course of the duration of a 
song element). The threshold may be set in response to a game difficulty level - so that 
the threshold is greater for an "easy" level and smaller for a '"hard'* level. 

OpficmaUy the comparator may exclude certain notes bom being assessed, in 
response to the current lyric, This is because some lyrics, such as those containing "ss'' or 

25 sounds, malce it hard to detect the pitch of th& user's voice accurately. So in order to 
avoid penalising the user's score, comparisons will not be attempted when such a word is 
being sung. The presence of such a word can be detected by comparing the current lyric 
in the song file with a list of ""difficult" words, or by detecting certain letter patterns in the 
current lyric, or by a flag being set in the song element, 

30 Score gen^tion logic 890 receives the results of the con]|)arison by the 

comparator 870 and generates a user score from th^ 

The score generation logic also controls the output of values by the registers 860» 
880 to the comparator 870 so as cairy out a correlation test to detect a time ofEset - i»e, a 
delay or advance between the target time fer singing a note and die actual time at which 



PAGE 15f34 ' RCVD AT 1l6f2006 8:26:45 AM (Eastern Standard Time] ' SVR:NYC-US-FAX-01i1 1 ' DNIS:5743 ' CSID: ' DURATION (mm-ss):1 1^ 



'6, JAN. 2006 13:34 0 YOUNG NO. 2478 P. 16/34 

wo 2005/005006 PCT/GB2004/002»9l 

the user sings the note - and subsequently to implemeat that time oSset doling scoiing of 
the iiser. Fox exiample, the time o£Eset can be detected at the beginning of a ^sentence" in 
the song, and that time o£[set used during the remaining of that sentence. 

To explain how this operation works, first consider an example set of tibiee 
5 consecutive notes defined by the song file, being the first three notes of a "sentence" in 
the song. It will be seen ifaat for this example, the total length of the three notes is six note 
clock periods. 



Not© 


Pitch 


Target statt time 


Length (measured in note dock penods) 


Nl 


PI 


to 


2 


N2 


P2 


ta 


1 


m 


P3 


Is 

> 


3 



10 The user's pitch (i.e. the pitch of any input &om the user via the microphone) is 

detected at each note clock period and is buffered m the register 860. The comparator 870 
then caixies out a set of comparisons, in soies, in j^rallel or Ux mother arrangement, so as 
to compare the user's contribution with the required notes at ±3 note clock periods away ' 
&om the target start time of the sequence. This means that only the first two notes^ Nl 

15 and N2, are considered in the process to set the ttooing of&et, because their total length 
adds up to three clock periods. So, if the user were to sing in perfect time, the ^q)ected 
bomparison of pitch would be as foUoi^: 



detected pitch at time t ^ 




detected pitch at time t ^ 


detected pitch at time t -i 


detected pitch at tima t o 




compare with P1 


detected pitch at time 1 1 




compare with P1 


detected pitch at time 1 2 




compare with P2 


detected pitch at time t ^ 




detected pitch at time t ^ 


detected pitch at time 1 5 


detected pitch at time t g 



20 However, flie present embodiment recognises that the user may not necessarily 

sing in perfect time, but could statt singing eaurly or late. The comparator S70 therefore 
carries out a series of comparisons over a range of delay/advance periods of ±3 note clock 



PAGE 16/34 ' RCVD AT 1/612006 8:26:45 AM [Eastern Standard Time] ' SVR:NYC4)S-FAX41/11 ' DNIS:S743 ' CSID: ' DURATION (mm-ss):1 144 



'6. JAN. 2006 13:34 0 YOUNG NO. 2478 P. 17/34 

wo 200SI/OOSOM PCX/GB2004/D02991 

periods. At one extreme, le. to test whethea: the user has started ginFng tlnee note dock 
periods too eaily, such a comparison might look like: 



aetecieo pitcn at time t ^ 




ucieCLcij piu,;n ai urns 


detected pitch at time 


detected pitch at time 


=> 


compare with Pi 


detected pitch at time 




compare with P1 


detected pitch at time t .1 




compare vinth P2 


detected pitch at time 1 0 




detected pitch at time i 1 


detected pitch at time t ^ 


detected pitch at time t a 


detected pitch at time 1 4 


detected pitch at time t s 


detected pilich at time tfi 



At the other extreme, i.e. to test v^ethCT the user has started singmg thiee notes 
too late^ ^ comparison mi^t look like: 



detected pitch at time L2 




detected pitch at time 


detected pitch at time tb 


detected pitch at time ti 


detected pitdi at time 


detected pitch at time k 




compare with PI 


detected pitch at time l{_ 




compare w^h P1 


detected pitch at time U 




compare with P2 


detected pitch at time ts 




detected pitch at time {7 



A threshold aznount similar to that described above is used dining the conelation 
10 test to detect the tioae ofiset The coirelalion test searches for the sung notes which are 
closest to the pilch of the required notes mi their aissociated timing, but they are 
disregarded for these purposes if fliey differ fcom the required notes by moi© than a 
threshold amount of 2 semitones (2 MIDI note values). (A variable flueshold is also used, 
as described above, in a subsequent analj^is of how well the user sang, taking the tune 
15 o£&et into account, m order to score diat user.) 

So, in the present example, a total of 7 such comparisons are carried out Bach 
comparison will provide three results, Le, the pitch errors in respect of the three note clock 
periods in the test gmup, In one anbodiment tiiese three results are combined (e.g. 
added) to give a smgle output value in respect of each possible delay/advance period. In 
20 another embodiment tlie pitch of only the £rst note clock period is ix>nsidered. 



PAGE 17134 ' RCVD AT 1/6/2006 8:26:45 AM (Eastern Standard Time] ' SVR:NYC-US-FAX41/11 ' DNiS:S743 ' CSID: * DURATION (mm-ss):11-04 



"6". JAN. 2006 13:34 D YOUNG NO. 2478 P. 18/34 

wo 2OOS/OOSOO1; PCTy6B2004/002991 

I 

Among ttie set of outputs &om the correlation test there will nonnally be a single 
coxxeldtLon "^eak'', that i$ to say a value of the delay/advance whicb gives thie greatest 
coiTelation between the user^s pitch for the three note clock periods (or the first note clock 
period) and the target pitch. The skilled person will understand that depending on the way 
5 in which the output vahies are represented, this could be the highest output value or the 
lowest output value. 

Assuming such a single peak ^dsts, then fbst value is canied fonvaid for use as a 
time offiet over the remainder of that sentence. For example, if it is found that the peak 
occnis when the ofiset is -3 note clock periods (Le. the user started singing 3 note clodc 

10 periods too early), tibien for the remainder of that sentraioe all further coinparisons are 
between the target pitch and the pitch that the user sang three clock periods earlier. The 
o&A is reset at the next sentence roarker in the song jGDle, and the correlation test 
described above is repeated. 

The detection of correlation can be applied to the three clock periods, as desoibed 

15 above, or can be applied just to a search for the closest match to the first note in the target 
sequence. 

If the player's singing is early^ ttien the closest match between the sung notes and 
the required notes is choseiL If the closest match is not close enough, i.e, it is more than 
the threshold amount of 2 semitones different, the of&et is set to 0 and no tiTniTig 

20 correction is made. If more fiian one note matches equally well, the oldest sung note 
having a pitch matching the jQrst note of the target sequence may be used to decide the 
offeet If the singing is late, we simply find the closest note.' If two notes match, or if there 
are two correlation *^peaks", then a convention is established to choose the earliest note or 
peak (Le. &e longest ago in time). 

25 A gsmeralisation of this is that if there is no peak at all (which could happen if the 

user was not singing at all) then an offset of zero is used for that sentence. 

At the start of a new sentence, the comparator can detect the pitches of the first 
three note clock periods of that sentence, excluding a first note (if present:) for which the 
midi-ru)te value is zero. If there ape fewer than three note clock periods in the sentence, 

30 then the total of all of the notes is detected. This then detennines the number of 
comparisons which have to be made for the correlation test, and also the time at which tibie 
last of those comparisons can be completed. Alternatively, particularly if the comparisons 
were executed in hardware, it may be convenieat to carry out a fixed number of 
comparisons but to use only a certain number of them used in the correlation test. In the 



PAGE 18/34 ' RCVD AT 116/2006 8:26:45 AM {Eastern Standard Time] ' SVR:NyC-US-FAX-01/11 ' DNIS:5743 ' CSID: ' DURATION (min-ss):11« 



"6. JAN. 2006 13:35 0 YOUNG NO. 2478 P. 19/34 

% 

WO 2005/005006 PCT/6B2004/002991 

ansngem^t described above^ ihe oc^lation test is earned out once^ near to the 
beguuwg of each senteoce, and the detected ofi&et period is cairied forward for the rest of 
that sentmce. Of coinse thm are many possible alternatives* For example, the 
conelation test could be caxried out for each newly sung note, or after each group of n 

s newly sung notes (e.g. ev^ ihree notes or note clock periods), so as to allow llie thne 
of&et to be modified dutmg the course of the sentence* In this case it i$ desirable to 
include a filtering operation, to limit variation of Hxe time ofiiset during the sentence. For 
example, in a simple arrangement, only a certain maximum variation (e.g. note clock 
period) could be allowed between successive detections of the time ofifset during a si^gje 

10 sent^e. 

hi a further variation, the time ofiset could be detected once, for example at the 
beginnjjAg of a song, and then the same time o£&et used for the remainder of that song. 

The correlation test could use features other than pitch. For example, known 
speedh detection logic could be used to d^ect the lyrics diat the user should have sun^ 
is with that detection being carried out over a range of ±a note clock periods as described 
above. Speech detection systems tend not to give an absolute answer as to whether a user 
said a certain word, but rather give a probabilily value that the word was said, So» the 

■ 

correlation test could seek the greatest combined probability over the range of possible 

ofeet periods lhat the user said (sang) the appropriate words for the first few notes. The 
20 output of a speech detector could be combined with the pitch detection to give a combined 

likelihood value of whether Ihe user was singing with each possible value of the 

delay/advance period. 

The maximum allowable time offset could be set in response to a game level (6.g. 

easy, medium, diflScult). At an "easy" game level, a time ofiEset of, say, ±3 note clock 
25 periods might be allowed, whereas at a "difficult'* game level, a ma:aiuum time ol^ of, 

say, ±1 note dock periods might be allowed. 

Figure 4 described the operation of tibe embodiment as a set of logical blocks, for 

clarity of the description. Of course, aUbough the blocks could be implemented in 

hardware, or in semi-programmable hardware (e.g. Md programmable gate airay(s)), 
30 these blocks may convenienfly be implemrated by parts of die PlayStBtion2 system shown 

in Figure 1 under tibie control of suitable software. One exan^jle of how diis may be 

achieved is as follows: 



PAGE 1§/34 * RCVD AT 1/6/2006 8:26:45 AM [Eastern Standard Time] * SVR:NYMS-FAX-01/1 1 ' DNIS:5743 ' CSID: ' DURATION (mm-ss):! 1-04 



6. JAN. 2006 13:35 



0 YOUNG 



NO. 2478 P. 20/34 



wo 200S/005006 



PCT/GB2004/00299I 



Comol logic 800 




Note clock geiLcmtor 840 



Pitch detector 850 
Registers 860, 880 
Cosiparator 870 
Scoiing logic 890 



Replay logic 820 



BaoLotion engine 100, aocessing: 
. DVD/CD interface 450 
SPU 300 (for audio otriput) 

lOP 700 (for microphone input) 
GS 200 (for video output) 



Figure 6 schematically illustrates an example screen display on the television set 
305. The lyxics to be sung 900 appear in a horizontally scrolling zow, with a current lyxic 
5 910 being hi^ghted. The detection of the curent lytic is made by the note clock 
stepping through the song elements in the song file, taking into account fhe number of 
clock periods specified in respect of each song elemjent 

A schematic representation of the pitch to be sung and the length of each note is 
given by horizontal bars 920. These are derived from the pitch values (midi-note values) 
10 in the song elements in the song file, but on the screen they do not give a precise measure 



of pitch or note leng^ but are arranged so as to indicate the general trend. So» if the next 
note is lower than the current note, the bar 920 will appear slightly lower on the screen, 
diough not necessaxily in any proportion to the difference in pitch between the two notes. 
Dashed lines 930 schematically indicate the user's attempt to sing each nole» as 

15 detected by the pitch detector. If the user's note is flat, the dashed line is drawn lower on 
the screen than the line 920 representing the corresponding target note. If the user's note 
is sharp, the dashed Ime will be drawn higher than the line 920. Shnilarly, late or early 
user notes are shown to the ri^t or left of the corresponding line 920, As before, the 
positional difiecences between the liioes are schematic rather than in proportion to any 

20 pitch or timing difference. 

Figure 7 schematically illustrates a perfectly timed singer. Each target note 940 
(shown as notes C, B, C, D, D ..) is sung 950 at exactly the right tune. 



PAGE 20/34 ' RCVD AT 116/2006 8:26:45 AM [Eastern Standard Time] ' SVR:NYC-US-FAX41/1 1 ' DNIS:5743 ' CSID: ' DURATION M:1 1-04 



6. JAN. 2006 13:36 0 YOUNG NO. 2478 P. 21/34 

WO 2005/005006 PCT/GB2004/002991 

In Figure 8, the conert sequence of notes, C, B, C, D, D is sung, but they aie 
sung late. lu the absence of the measures described above, all but one of the notes would 
be judged to be woug - and to be significantly sharp or flat 

In Figure 9, denotes axe aOl sung late, but the timing ofifeet 960 is applied so that 
5 the constant degree of latmess is compensated by the registers 860, 880. The user is 
given a high score. 

In so &r as the embodiments of the invention described above are implemeixted, at 
least in part, using software^ntrolled data processing apparatus^ it wiQ be appreciated 
that a computer program providing such soflwaie control and a storage medium by which 
10 such a computer progiaai is stored arc envisaged as aspects of the pesent inventioiL 



PAGE 21134 ' RCVD AT 11612006 8:26:45 AM [Eastern Standard Time]' SVR:NYC-US-FAX41l1t * DNIS:S743' CSID: ' DURATION (nim-ss):11-04 



